%%{init: {"flowchart": {"nodeSpacing": 20, "rankSpacing": 40, 'height':1}}}%%
graph LR
subgraph Inputs
direction LR
style Inputs fill:#a7bde0,stroke:#64b5f6,font-size:30,stroke-width:2px
x1[x<sub>1</sub>]
x2[x<sub>2</sub>]
x3[x<sub>3</sub>]
x4[x<sub>4</sub>]
x5[x<sub>5</sub>]
end
subgraph "hidden-layer" ["Hidden Layer"]
direction LR
style hidden-layer fill:#a7e0b3,stroke:#85ff9f,font-size:30,stroke-width:2px
h1(h<sub>1</sub>)
h2(h<sub>2</sub>)
h3(h<sub>3</sub>)
end
subgraph Output
direction LR
style Output fill:#e78383,stroke:#f8cc52,font-size:30,stroke-width:2px
y(y<sub>1</sub>)
end
x1 --> |w<sub>11</sub><sup>1</sup>| h1
x1 --> |w<sub>12</sub><sup>1</sup>| h2
x1 --> |w<sub>13</sub><sup>1</sup>| h3
x2 --> |w<sub>21</sub><sup>1</sup>| h1
x2 --> |w<sub>22</sub><sup>1</sup>| h2
x2 --> |w<sub>23</sub><sup>1</sup>| h3
x3 --> |w<sub>31</sub><sup>1</sup>| h1
x3 --> |w<sub>32</sub><sup>1</sup>| h2
x3 --> |w<sub>33</sub><sup>1</sup>| h3
x4 --> |w<sub>41</sub><sup>1</sup>| h1
x4 --> |w<sub>42</sub><sup>1</sup>| h2
x4 --> |w<sub>43</sub><sup>1</sup>| h3
x5 --> |w<sub>51</sub><sup>1</sup>| h1
x5 --> |w<sub>52</sub><sup>1</sup>| h2
x5 --> |w<sub>53</sub><sup>1</sup>| h3
h1 --> |w<sub>11</sub><sup>2</sup>| y
h2 --> |w<sub>21</sub><sup>2</sup>| y
h3 --> |w<sub>31</sub><sup>2</sup>| y
y --> Out(["Softmax Activation"])
RNN - Recurrent Neural Network
RNN
RNN stands for Recurrent Neural Network, which is a type of artificial neural network that processes sequential data. RNNs are used in a variety of applications, such as speech recognition and sentiment analysis.
1. Why ANN can’t be used in sequential data?
Artificial Neural Networks (ANNs) struggle with sequential data due to inherent structural limitations. Here’s a breakdown of the key reasons:
Reason 1. Fixed Input/Output Size Requirement
In real life, sequential data (text, time series, sensor readings) often has variable lengths. For example: e.g.,| Sequence | Size |
|---|---|
| Python is a OOPS programming language | 6 |
| I Love India | 3 |
| I am playing football | 4 |
Suppose you make an ANN having the below structure.
It has 3 input nodes.
Our first sentence contains 6 words, hence the weight metrics will be 6 * 5 structure.
The second sentence contains 3 words hence the weight metrics will be 3 * 5 structure.
The third sentence contains 4 words hence the weight metrics will be 4 * 5 structure.
We can see here the structure of the input weight metrics is changing based on the input text, which is not practical for designing.
Reason 2. Zero Padding Unnecessary Computation
To solve the first issue of varying length we can use the zero padding technique.
First, we can count the sentence having maximum words.
In our case we have the first sentence having a maximum of 6 number of words.
So we will fix our input text size to a maximum of 6 words.
In the second sentence, we have a number of words, as we have fixed our input to 6 words, but we have 3 words in 2nd sentence hence we will append 3 more vectors having zero values inside it.
Hence it is called zero padding.
The problem with zero padding is that if we have the maximum word of a sentence is 1000 words.
Then we will fix the input length to 1000 nodes.
But if we got a sentence having only 5 words then for the rest of the 995 words we have to use zero padding.
Which will take extra memory and computation power, decrease the training speed of the model and undesirable.
Reason 3. Prediction Problem On Different Input Length
- In our case, we have set our input length to 6 words while training the model.
- But while predicting suppose we got an input text having the length of 10 words, at that time our model will fail.
- Because we have trained our model with a fixed input size of 6 words, it will not be able to predict for 10 words.
Reason 4. Not Considering Sequential Information
- ANN architecture does not take into account the sequence information of the input text.
- When we pass the input text to the ANN model it will take all the input at a time.
- When we enter vales at a time it will be mixed up inside the network, hence the sequence information is discarded.
- The sequence information is discarded in the ANN model.
- Hence it is not suitable for the sequential data.
Reason 5. Lack of temporal memory
ANNs process inputs independently, with no mechanism to retain information from previous steps. This makes them unsuitable for tasks requiring context, such as:
Language: The word “lie” means different things in “never tell a lie” vs. “lie down”.
Time series: Predicting stock prices requires historical trends, not just isolated data points.
Example: In the sentence “The cat chased the…”, ANNs cannot retain the context of “cat” to predict “mouse” as the next word.
2. RNN Forward Propagation - Step by Step
In forward propagation of an RNN (Recurrent Neural Network), the network processes input sequences step by step. At each time step \(t\), it takes the current input \(x_t\) and the hidden state from the previous step \(h_{t−1}\), applies weights and activation functions (like tanh), and computes the new hidden state \(h_t\). This hidden state is then used to predict the output \(y_t\).
The process repeats for each time step, allowing the RNN to capture temporal dependencies in sequential data.
1. Problem Setup
We have a dataset of sentences:
| Sequence | Size |
|---|---|
| “Show is nice” | 3 |
| “Show is not nice” | 4 |
| “Show is worst” | 3 |
Each word is represented using One-Hot Encoding (OHE).
2. Network Architecture
- Input Layer: 5 neurons (each representing a one-hot encoded word)
- Hidden Layer: 3 neurons (processing sequential information)
- Output Layer: 1 neuron (final prediction using softmax activation)
3. One-Hot Encoding Representation
Since we have five unique words {“Show”, “is”, “nice”, “worst”, “not”}, each word is a 5-dimensional vector:
| Word | One-Hot Encoding |
|---|---|
| Show | [1, 0, 0, 0, 0] |
| is | [0, 1, 0, 0, 0] |
| nice | [0, 0, 1, 0, 0] |
| worst | [0, 0, 0, 1, 0] |
| not | [0, 0, 0, 0, 1] |
Each word is now represented as a 5-dimensional vector.
4. Defining RNN Parameters
- Weight matrices:
- Input-to-Hidden weights \((W_x)\):
3 × 5matrix - Hidden-to-Hidden weights \((W_h)\):
3 × 3matrix - Bias \((b)\):
3 × 1vector - Hidden-to-Output weights \((W_y)\):
1 × 3matrix - Output bias \((b_y)\):
1 × 1scalar
- Input-to-Hidden weights \((W_x)\):
Weight Matrices
\[ W_x = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 & 1.0 \\ 1.1 & 1.2 & 1.3 & 1.4 & 1.5 \end{bmatrix} \] \[ W_h = \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \end{bmatrix} \] \[ b = \begin{bmatrix} 0.1 \\ 0.1 \\ 0.1 \end{bmatrix} \] \[ W_y = \begin{bmatrix} 0.5 & 0.6 & 0.7 \end{bmatrix} \] \[ b_y = \begin{bmatrix} 0.2 \end{bmatrix} \]
5. Forward Propagation Formula
For each time step t:
\[ h_t = \tanh(W_x x_t + W_h h_{t-1} + b) \] \[ y_t = \text{softmax}(W_y h_t + b_y) \]
where:
- \(x_t\) = Input word (one-hot encoded vector of shape
5 × 1) - \(h_t\) = Hidden state (
3 × 1) - \(y_t\) = Output (
1 × 1scalar)
6. Forward Propagation Calculation
Step 1: Processing First Word “Show” (t = 1)
1. Input Vector for “Show”
\[ x_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \]
2. Detailed calculation of \((W_x \cdot x_1)\):
\[ W_x x_1 = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 & 1.0 \\ 1.1 & 1.2 & 1.3 & 1.4 & 1.5 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \] \[ = \begin{bmatrix} (0.1 \times 1) + (0.2 \times 0) + (0.3 \times 0) + (0.4 \times 0) + (0.5 \times 0) \\ (0.6 \times 1) + (0.7 \times 0) + (0.8 \times 0) + (0.9 \times 0) + (1.0 \times 0) \\ (1.1 \times 1) + (1.2 \times 0) + (1.3 \times 0) + (1.4 \times 0) + (1.5 \times 0) \end{bmatrix} \] \[ = \begin{bmatrix} 0.1 \\ 0.6 \\ 1.1 \end{bmatrix} \]
3. Detailed calculation of \(W_h \cdot h_0\):
Since \(h_0\) is initialized to zeros: \[ W_h h_0 = \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \]
4. calculate \(z_1\):
\[ z_1 = W_x x_1 + W_h h_0 + b = \begin{bmatrix} 0.1 \\ 0.6 \\ 1.1 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.1 \\ 0.1 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.7 \\ 1.2 \end{bmatrix} \]
5. Apply tanh activation
\[ h_1 = \tanh(z_1) = \begin{bmatrix} \tanh(0.2) \\ \tanh(0.7) \\ \tanh(1.2) \end{bmatrix} \approx \begin{bmatrix} 0.198 \\ 0.604 \\ 0.833 \end{bmatrix} \]
Step 2: Processing Second Word “is” (t = 2)
1. Input Vector for “is”
\[ x_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \]
2. Detailed calculation of \(W_x \cdot x_2\):
\[ W_x x_2 = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 & 1.0 \\ 1.1 & 1.2 & 1.3 & 1.4 & 1.5 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \] \[ = \begin{bmatrix} (0.1 \times 0) + (0.2 \times 1) + (0.3 \times 0) + (0.4 \times 0) + (0.5 \times 0) \\ (0.6 \times 0) + (0.7 \times 1) + (0.8 \times 0) + (0.9 \times 0) + (1.0 \times 0) \\ (1.1 \times 0) + (1.2 \times 1) + (1.3 \times 0) + (1.4 \times 0) + (1.5 \times 0) \end{bmatrix} \] \[ = \begin{bmatrix} 0.2 \\ 0.7 \\ 1.2 \end{bmatrix} \]
3. Detailed calculation of \(W_h \cdot h_1\):
\[ W_h h_1 = \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \end{bmatrix} \begin{bmatrix} 0.198 \\ 0.604 \\ 0.833 \end{bmatrix} \] \[ = \begin{bmatrix} (0.9 \times 0.198) + (0.8 \times 0.604) + (0.7 \times 0.833) \\ (0.6 \times 0.198) + (0.5 \times 0.604) + (0.4 \times 0.833) \\ (0.3 \times 0.198) + (0.2 \times 0.604) + (0.1 \times 0.833) \end{bmatrix} \] \[ = \begin{bmatrix} 0.178 + 0.483 + 0.583 \\ 0.119 + 0.302 + 0.333 \\ 0.059 + 0.121 + 0.083 \end{bmatrix} = \begin{bmatrix} 1.244 \\ 0.754 \\ 0.263 \end{bmatrix} \]
4. calculate \(z_2\):
\[ z_2 = W_x x_2 + W_h h_1 + b = \begin{bmatrix} 0.2 \\ 0.7 \\ 1.2 \end{bmatrix} + \begin{bmatrix} 1.244 \\ 0.754 \\ 0.263 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.1 \\ 0.1 \end{bmatrix} = \begin{bmatrix} 1.544 \\ 1.554 \\ 1.563 \end{bmatrix} \]
5. Apply tanh activation
\[ h_2 = \tanh(z_2) = \begin{bmatrix} \tanh(1.544) \\ \tanh(1.554) \\ \tanh(1.563) \end{bmatrix} \approx \begin{bmatrix} 0.911 \\ 0.914 \\ 0.917 \end{bmatrix} \]
Step 3: Processing Third Word “nice” (t = 3)
1. Input Vector for “nice”
\[ x_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix} \]
2. Calculate \(W_x \cdot x_3\)
\[ W_x x_3 = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 & 1.0 \\ 1.1 & 1.2 & 1.3 & 1.4 & 1.5 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix} \] \[ = \begin{bmatrix} 0.3 \\ 0.8 \\ 1.3 \end{bmatrix} \]
3. Calculate \(W_h \cdot h_2\)
From Step 2, we have: \[ h_2 = \begin{bmatrix} 0.911 \\ 0.914 \\ 0.917 \end{bmatrix} \]
Now compute: \[ W_h h_2 = \begin{bmatrix} 0.9 & 0.8 & 0.7 \\ 0.6 & 0.5 & 0.4 \\ 0.3 & 0.2 & 0.1 \end{bmatrix} \begin{bmatrix} 0.911 \\ 0.914 \\ 0.917 \end{bmatrix} \] \[ = \begin{bmatrix} 2.193 \\ 1.371 \\ 0.548 \end{bmatrix} \]
4. Calculate \(z_3\)
Add the bias: \[ z_3 = W_x x_3 + W_h h_2 + b = \begin{bmatrix} 0.3 \\ 0.8 \\ 1.3 \end{bmatrix} + \begin{bmatrix} 2.193 \\ 1.371 \\ 0.548 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.1 \\ 0.1 \end{bmatrix} \] \[ = \begin{bmatrix} 2.593 \\ 2.271 \\ 1.948 \end{bmatrix} \]
5. Apply tanh activation
\[ h_3 = \tanh(z_3) = \begin{bmatrix} \tanh(2.593) \\ \tanh(2.271) \\ \tanh(1.948) \end{bmatrix} \approx \begin{bmatrix} 0.989 \\ 0.979 \\ 0.961 \end{bmatrix} \]
Step 4: Output Calculation for “nice”
1. Calculate \(W_y \cdot h_3\)
\[ W_y h_3 = \begin{bmatrix} 0.5 & 0.6 & 0.7 \end{bmatrix} \begin{bmatrix} 0.989 \\ 0.979 \\ 0.961 \end{bmatrix} \] \[ = 0.495 + 0.587 + 0.673 = 1.755 \]
2. Add output bias
\[ W_y h_3 + b_y = 1.755 + 0.2 = 1.955 \]
3. Apply softmax/sigmoid activation
\[ y_{\text{final}} = \frac{1}{1 + e^{-1.955}} \approx 0.876 \]
Summary
- The hidden state is updated at each time step using the input word and the previous hidden state.
- The output is computed using the final hidden state and passed through a softmax/sigmoid activation.
- This process can be repeated for any number of words in the sequence.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#dcdede', 'fontSize': '25px', 'textWrapWidth': 200 }, 'viewBox': '0 0 1200 1200' }}%%
graph LR
subgraph "t_0" ["Time step t=0"]
direction TB
style t_0 fill:#9199e1,stroke:#999988,stroke-width:2px,font-size:25px,color:#080b2c
subgraph inputs0 ["Input: Show"]
direction LR
style inputs0 fill:#2e7c92,stroke:#64b5f6,stroke-width:2px
x0["x₀ = [1,0,0,0,0]"]
end
subgraph hidden0 ["Hidden Layer"]
direction LR
style hidden0 fill:#5fb48b,stroke:#85ff9f,stroke-width:2px
h0_1(h_01)
h0_2(h_02)
h0_3(h_03)
end
end
subgraph "t_1" ["Time step t=1"]
direction TB
style t_1 fill:#e19f91,stroke:#999999,stroke-width:2px,font-size:25px,color:#080b2c
subgraph inputs1 ["Input: is"]
direction LR
style inputs1 fill:#2e7c92,stroke:#64b5f6,stroke-width:2px
x1["x₁ = [0,1,0,0,0]"]
end
subgraph hidden1 ["Hidden Layer"]
direction LR
style hidden1 fill:#5fb48b,stroke:#85ff9f,stroke-width:2px
h1_1(h_01)
h1_2(h_02)
h1_3(h_03)
end
end
subgraph "t_2" ["Time step t=2"]
direction TB
style t_2 fill:#c891e1,stroke:#999999,stroke-width:2px,font-size:25px,color:#080b2c
subgraph inputs2 ["Input: nice"]
direction LR
style inputs2 fill:#2e7c92,stroke:#64b5f6,stroke-width:2px
x2["x₂ = [0,0,1,0,0]"]
end
subgraph hidden2 ["Hidden Layer"]
direction LR
style hidden2 fill:#5fb48b,stroke:#85ff9f,stroke-width:2px
h2_1(h_01)
h2_2(h_02)
h2_3(h_03)
end
subgraph output ["Output"]
direction LR
style output fill:#85929e,stroke:#f8cc52,stroke-width:2px
y_final(y)
end
end
%% Input to hidden connections with Wx
x0 -.->|W_x| h0_1
x0 -.->|W_x| h0_2
x0 -.->|W_x| h0_3
x1 -.->|W_x| h1_1
x1 -.->|W_x| h1_2
x1 -.->|W_x| h1_3
x2 -.->|W_x| h2_1
x2 -.->|W_x| h2_2
x2 -.->|W_x| h2_3
%% Recurrent connections with Wh (Red)
h0_1 ===>|W_h| h1_1
h0_2 ===>|W_h| h1_1
h0_3 ===>|W_h| h1_1
h0_1 ===>|W_h| h1_2
h0_2 ===>|W_h| h1_2
h0_3 ===>|W_h| h1_2
h0_1 ===>|W_h| h1_3
h0_2 ===>|W_h| h1_3
h0_3 ===>|W_h| h1_3
h1_1 ===>|W_h| h2_1
h1_2 ===>|W_h| h2_1
h1_3 ===>|W_h| h2_1
h1_1 ===>|W_h| h2_2
h1_2 ===>|W_h| h2_2
h1_3 ===>|W_h| h2_2
h1_1 ===>|W_h| h2_3
h1_2 ===>|W_h| h2_3
h1_3 ===>|W_h| h2_3
%% Hidden to output connections (Red)
h2_1 ==> |W_y| y_final
h2_2 ==> |W_y| y_final
h2_3 ==> |W_y| y_final
y_final --> Out(["Softmax Activation"])
%% Bias connections are implied
model1 = Sequential()
model1.add(SimpleRNN(3, input_shape=(None, 5), activation='tanh'))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model1.summary()Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
simple_rnn_1 (SimpleRNN) (None, 3) 27
dense_1 (Dense) (None, 1) 4
=================================================================
Total params: 31 (124.00 Byte)
Trainable params: 31 (124.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
3. Types of RNN Architecture
3.1 One-to-One
This is the standard neural network (not sequence-based).
Use case: Image classification
e.g.,
Input: An image
Output: A label like “cat” or “dog”
3.2 One-to-Many
A single input produces a sequence of outputs.
Use case: Image captioning
e.g.,
Input: An image
Output: A sequence of words describing the image
3.3 Many-to-One
A sequence of inputs leads to a single output.
Use case: Sentiment analysis
e.g.,
Input: A sentence like “This movie is great”
Output: Sentiment label such as “Positive”
3.4 Many-to-Many (Synchronized)
Each input has a corresponding output at every time step.
Input and output sequences are of the same length.
Use case: Part-of-speech tagging
e.g.,
Input: “The dog barked”
Output: “DET NOUN VERB”
3.5 Many-to-Many (Unaligned / Encoder-Decoder)
Input and output sequences are of different lengths.
Uses an encoder to read input, and a decoder to generate output.
Use case: Machine translation
e.g.,
Input: Let’s Run.
Output: Chalo bhagte hain.
4. Backpropagation Through Time (BPTT) for a Simple RNN
%%{init: {"theme":"neutral", "flowchart": {"nodeSpacing": 30, "rankSpacing": 60}}}%%
flowchart LR
%% Time Step Labels
T0["<b>t = 1</b>"]:::time --> A1
T1["<b>t = 2</b>"]:::time --> A2
T2["<b>t = 3</b>"]:::time --> A3
%% Forward flow
A0["O<sub>0</sub><br><i>h₀ = 0</i>"]:::init --> A1["O<sub>1</sub><br><i>a₁ = tanh(W<sub>xh</sub>x₁ + W<sub>hh</sub>h₀ + b)</i>"]:::cell --> A2["O<sub>2</sub><br><i>a₂ = tanh(W<sub>xh</sub>x₂ + W<sub>hh</sub>a₁ + b)</i>"]:::cell --> A3["O<sub>3</sub><br><i>a₃ = tanh(W<sub>xh</sub>x₃ + W<sub>hh</sub>a₂ + b)</i>"]:::cell --> Y["ŷ = sigmoid(W<sub>hy</sub>a₃ + b)"]:::output --> L["Loss<br>L(ŷ, y)"]:::loss
%% Inputs
X1["x₁ ∈ ℝ²"]:::input --> A1
X2["x₂ ∈ ℝ²"]:::input --> A2
X3["x₃ ∈ ℝ²"]:::input --> A3
%% Weight dimension notes
A1 -.-> Wxh["W<sub>xh</sub> ∈ ℝ<sup>2×2</sup>"]:::weight
A1 -.-> Whh["W<sub>hh</sub> ∈ ℝ<sup>2×2</sup>"]:::weight
A3 -.-> Why["W<sub>hy</sub> ∈ ℝ<sup>1×2</sup>"]:::weight
%% Backward flow (backprop)
L -.-> Yb["∂L/∂ŷ"]:::grad --> A3b["∂L/∂a₃"]:::grad --> A2b["∂L/∂a₂"]:::grad --> A1b["∂L/∂a₁"]:::grad
%% Dashed arrows for gradients
L -.-> Yb
Yb -.-> A3b
A3b -.-> A2b
A2b -.-> A1b
%% Styling
classDef cell fill:#b2fab4,stroke:#333,stroke-width:1px;
classDef input fill:#f9caca,stroke:#333,stroke-width:1px;
classDef output fill:#cce5ff,stroke:#333,stroke-width:1px;
classDef loss fill:#ffd9a0,stroke:#333,stroke-width:1px;
classDef weight fill:#f2f2f2,stroke:#bbb,stroke-dasharray: 4 2;
classDef grad fill:#ffe0e0,stroke:#cc0000,stroke-dasharray: 5 3;
classDef time fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5;
classDef init fill:#eeeeee,stroke:#999,stroke-dasharray: 2 2;
class T0,T1,T2 time
class A0 init
class A1,A2,A3 cell
class X1,X2,X3 input
class Y output
class L loss
class Wxh,Whh,Why weight
class Yb,A3b,A2b,A1b grad
%%{init: {"theme":"neutral", "flowchart": {"nodeSpacing": 30, "rankSpacing": 60}}}%%
flowchart LR
%% Time Steps
T0("at <b>t = 1</b>"):::time --> E3{{"O<sub>1</sub><br><i>tanh</i>"}}:::cell
T1("at <b>t = 2</b>"):::time --> D3{{"O<sub>2</sub><br><i>tanh</i>"}}:::cell
T2("at <b>t = 3</b>"):::time --> C1{{"<b>O<sub>3</sub></b><br><i>tanh</i>"}}:::cell
%% Output and Loss
A["<b>L</b><br><i>(Loss)</i>"]:::loss --> B["<b>y<sub>pred</sub></b><br><i>sigmoid</i>"]:::output
B --> C2(["<b>W<sub>final</sub></b><br><i>(hidden → output)</i>"]):::weight
B --> BY["<b>b<sub>y</sub></b>"]:::bias
B --> C1
%% Layer t=2
C1 --> D1(["X<sub>3</sub>"]):::input
C1 --> D4(["W<sub>h</sub><br>(h → h)"]):::weight
C1 --> D2(["W<sub>input</sub><br>(x → h)"]):::weight
C1 --> D3
C1 --> BH["<b>b<sub>h</sub></b>"]:::bias
%% Layer t=1
D3 --> E1(["X<sub>2</sub>"]):::input
D3 --> E4(["W<sub>h</sub><br>(h → h)"]):::weight
D3 --> E2(["W<sub>input</sub><br>(x → h)"]):::weight
D3 --> E3
D3 --> BH2["b<sub>h</sub>"]:::bias
%% Layer t=0
E3 --> F1(["X<sub>1</sub>"]):::input
E3 --> F2(["W<sub>input</sub><br>(x → h)"]):::weight
E3 --> F3{{"O<sub>0</sub><br><i>h₀ = 0</i>"}}:::init
E3 --> F4(["W<sub>h</sub><br>(h → h)"]):::weight
E3 --> BH3["b<sub>h</sub>"]:::bias
%% Class styling
classDef cell fill:#b2fab4,stroke:#333,stroke-width:1px
classDef input fill:#f9caca,stroke:#333,stroke-width:1px
classDef output fill:#cce5ff,stroke:#333,stroke-width:1px
classDef weight fill:#e0e0e0,stroke:#333,stroke-width:1px
classDef loss fill:#ffd9a0,stroke:#333,stroke-width:1px
classDef time fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5
classDef init fill:#eeeeee,stroke:#999,stroke-dasharray: 2 2
classDef bias fill:#fff2cc,stroke:#333,stroke-width:1px
%% Individual node styles
style A stroke:#D50000,fill:#D50000,color:#FFFFFF
style B fill:#FFFFFF
style C2 fill:#FFE0B2
style D2 fill:#BBDEFB
style E2 stroke:#000000,fill:#BBDEFB
style F2 fill:#BBDEFB,stroke:#424242
A complete derivation of the backpropagation algorithm for a simple Recurrent Neural Network (RNN), using a 3-dimensional one-hot encoded input sequence over 3 time steps.
4.1. RNN Architecture and Parameters
Let: - Input dimension: \(d = 3\) - Hidden size: \(h = 2\) - Output size: \(1\) - Timesteps: \(T = 3\)
Parameters:
- \(W_{\text{input}} \in \mathbb{R}^{2 \times 3}\): weights from input to hidden layer
- \(W_h \in \mathbb{R}^{2 \times 2}\): recurrent weights (hidden to hidden)
- \(\mathbf{b}_h \in \mathbb{R}^{2 \times 1}\): hidden bias
- \(W_{\text{final}} \in \mathbb{R}^{1 \times 2}\): weights from hidden to output
- \(b_y \in \mathbb{R}\): output bias
4.2. One-Hot Encoded Input Sequence
The input sequence \(\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3 \in \mathbb{R}^3\) is one-hot encoded:
| Time Step | Input Vector \(\mathbf{x}_t\) |
|---|---|
| \(t = 1\) | \(\mathbf{x}_1 = [1, 0, 0]\) |
| \(t = 2\) | \(\mathbf{x}_2 = [0, 1, 0]\) |
| \(t = 3\) | \(\mathbf{x}_3 = [0, 0, 1]\) |
These represent categorical input tokens mapped to one-hot vectors.
4.3. Forward Pass Equations
- Hidden state initialization:
\[ \mathbf{O}_0 = \mathbf{0} \in \mathbb{R}^2 \]
- For each \(t = 1, 2, 3\):
\[ \mathbf{a}_t = W_{\text{input}} \cdot \mathbf{x}_t + W_h \cdot \mathbf{O}_{t-1} + \mathbf{b}_h \]
\[ \mathbf{O}_t = \tanh(\mathbf{a}_t) \]
such that,
\(\mathbf{a}_1 = W_{\text{input}} \cdot \mathbf{x}_1 + W_h \cdot \mathbf{O}_{0} + \mathbf{b}_h\)
\(\mathbf{O}_1 = \tanh(\mathbf{a}_1)\)
\(\mathbf{a}_2 = W_{\text{input}} \cdot \mathbf{x}_2 + W_h \cdot \mathbf{O}_{1} + \mathbf{b}_h\)
\(\mathbf{O}_2 = \tanh(\mathbf{a}_2)\)
\(\mathbf{a}_3 = W_{\text{input}} \cdot \mathbf{x}_3 + W_h \cdot \mathbf{O}_{2} + \mathbf{b}_h\)
\(\mathbf{O}_3 = \tanh(\mathbf{a}_3)\)
- Output computation (only after final hidden state): \[ z_f = W_{\text{final}} \cdot \mathbf{O}_3 + b_y \] \[ \hat{y} = \sigma(z_f) = \sigma(W_{\text{final}} \cdot \mathbf{O}_3 + b_y) \]
where, \(\sigma(z_f) = \frac{1}{1 + e^{-z_f}}\) is the sigmoid activation.
Gradient flow in BPTT visualization
%%{init: {"theme":"neutral", "flowchart": {"nodeSpacing": 30, "rankSpacing": 60}}}%%
flowchart LR
T0["at <b>t = 1</b>"] --> E3{{"O<sub>1</sub><br><i>tanh</i>"}}
T1["at <b>t = 2</b>"] --> D3{{"O<sub>2</sub><br><i>tanh</i>"}}
T2["at <b>t = 3</b>"] --> C1{{"<b>O<sub>3</sub></b><br><i>tanh</i>"}}
A["<b>L</b><br><i>(Loss)</i>"] --> B["<b>y<sub>pred</sub></b><br><i>sigmoid</i>"]
B --> C2(["<b>W<sub>final</sub></b><br><i>(hidden → output)</i>"]) & BY["<b>b<sub>y</sub></b>"] & C1
C1 --> D1(["X<sub>3</sub>"]) & D2(["W<sub>input</sub><br>(x → h)"]) & D3 & D4(["W<sub>h</sub><br>(h → h)"]) & BH["<b>b<sub>h</sub></b>"]
D3 --> E1(["X<sub>2</sub>"]) & E2(["W<sub>input</sub><br>(x → h)"]) & E3 & E4(["W<sub>h</sub><br>(h → h)"]) & BH2["b<sub>h</sub>"]
E3 --> F1(["X<sub>1</sub>"]) & F2(["W<sub>input</sub><br>(x → h)"]) & F3{{"O<sub>0</sub><br><i>h₀ = 0</i>"}} & F4(["W<sub>h</sub><br>(h → h)"]) & BH3["b<sub>h</sub>"]
A -. "∂L/∂y<sub>pred</sub>" .-> B
B -. "∂y<sub>pred</sub>/∂W<sub>final</sub>" .-> C2
B -. "∂y<sub>pred</sub>/∂b<sub>y</sub>" .-> BY
B -. "∂y<sub>pred</sub>/∂O₃" .-> C1
C1 -. "∂O₃/∂W<sub>h</sub>" .-> D4
C1 -. "∂O₃/∂W<sub>input</sub>" .-> D2
C1 -. "∂O₃/∂b<sub>h</sub>" .-> BH
C1 -. "∂O₃/∂O₂" .-> D3
D3 -. "∂O₂/∂W<sub>h</sub>" .-> E4
D3 -. "∂O₂/∂W<sub>input</sub>" .-> E2
D3 -. "∂O₂/∂b<sub>h</sub> ".-> BH2
D3 -. "∂O₂/∂O₁" .-> E3
E3 -. "∂O₁/∂W<sub>h</sub>" .-> F4
E3 -. "∂O₁/∂W<sub>input</sub>" .-> F2
E3 -. "∂O₁/∂b<sub>h</sub> ".-> BH3
T0:::time
T0:::time
E3:::cell
T1:::time
T1:::time
D3:::cell
T2:::time
T2:::time
C1:::cell
A:::loss
B:::output
C2:::weight
BY:::bias
D1:::input
D2:::weight
D4:::weight
BH:::bias
E1:::input
E2:::weight
E4:::weight
BH2:::bias
F1:::input
F2:::weight
F3:::init
F3:::init
F4:::weight
BH3:::bias
classDef cell fill:#b2fab4,stroke:#333,stroke-width:1px
classDef input fill:#f9caca,stroke:#333,stroke-width:1px
classDef output fill:#cce5ff,stroke:#333,stroke-width:1px
classDef weight fill:#e0e0e0,stroke:#333,stroke-width:1px
classDef loss fill:#ffd9a0,stroke:#333,stroke-width:1px
classDef time fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5
classDef init fill:#eeeeee,stroke:#999,stroke-dasharray: 2 2
classDef bias fill:#fff2cc,stroke:#333,stroke-width:1px
style A fill:#D50000,color:#FFFFFF
style B fill:#FFFFFF
style C2 fill:#FFE0B2
style D2 fill:#BBDEFB
style E2 fill:#BBDEFB
style F2 fill:#BBDEFB
4.4. Loss Function (Binary Cross Entropy)
Given target \(y \in \{0, 1\}\):
\[ \mathcal{L} = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})] \]
4.5. Backward Pass: Output Layer Gradients
4.5.1 Start by computing the gradient with respect to output:
\(\displaystyle\Rightarrow \frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}} \left( - y \log(\hat{y}) - (1 - y)\log(1 - \hat{y}) \right) = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}\)
\(\displaystyle\Rightarrow \frac{\partial \mathcal{L}}{\partial z_f} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{d\hat{y}}{dz_f} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \cdot \hat{y}(1 - \hat{y}) = (\hat{y} - y)\)
4.5.2 Gradient w.r.t. output layer parameters:
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{\text{final}}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{d\hat{y}}{dz_f} \cdot \frac{\partial z_f}{\partial W_{\text{final}}} = \frac{\partial \mathcal{L}}{\partial z_f} \cdot \frac{\partial z_f}{\partial W_{\text{final}}} = (\hat{y} - y) \cdot \mathbf{O}_3\)
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial b_y} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{d\hat{y}}{dz_f} \cdot \frac{\partial z_f}{\partial b_y} = (\hat{y} - y)\)
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial \mathbf{O}_3} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{d\hat{y}}{dz_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} = (\hat{y} - y) \cdot W_{\text{final}}\)
4.6. Full Gradient Derivations (Chain Rule)
We will now expand the gradients of \(\mathcal{L}\) w.r.t. \(W_{\text{input}}\), \(W_h\), and \(\mathbf{b}_h\) fully via chain rule for all 3 timesteps.
4.6.1. Gradient w.r.t. \(W_{\text{input}}\)
At \(t = 3\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{\text{input}}} \text{ from (t=3)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial W_{\text{input}}}\)
\(\hspace{5.3cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot \mathbf{x}_3\)
At \(t = 2\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{\text{input}}} \text{ from (t=2)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial \mathbf{\mathbf{O}}_2}\cdot \frac{\partial \mathbf{O}_2}{\partial \mathbf{a}_2}\cdot \frac{\partial \mathbf{a}_2}{\partial W_{\text{input}}}\)
\(\hspace{5.3cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot W_h \cdot (1 - \mathbf{O}_2^2) \cdot \mathbf{x}_2\)
At \(t = 1\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{\text{input}}} \text{ from (t=1)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial \mathbf{\mathbf{O}}_2}\cdot \frac{\partial \mathbf{O}_2}{\partial \mathbf{a}_2}\cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{\mathbf{O}}_1}\cdot \frac{\partial \mathbf{O}_1}{\partial \mathbf{a}_1}\cdot \frac{\partial \mathbf{a}_1}{\partial W_{\text{input}}}\)
\(\hspace{5.3cm}= (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot W_h \cdot (1 - \mathbf{O}_2^2) \cdot W_h \cdot (1 - \mathbf{O}_1^2) \cdot \mathbf{x}_1\)
Putting all together:
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{\text{input}}} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \left[ \mathbf{x}_3 + W_h \cdot (1 - \mathbf{O}_2^2) \cdot \mathbf{x}_2 + W_h \cdot (1 - \mathbf{O}_2^2) \cdot W_h \cdot (1 - \mathbf{O}_1^2) \cdot \mathbf{x}_1 \right]\)
In general form:
\[ \frac{\partial \mathcal{L}}{\partial W_{\text{input}}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_T}\cdot \frac{{\partial \mathbf{O}_T}}{\partial \mathbf{a}_T}\cdot \left[ \prod_{k=t+1}^{T} \left( \frac{\partial \mathbf{a}_k}{\partial \mathbf{O}_{k-1}} \cdot \frac{\partial \mathbf{O}_{k-1}}{\partial \mathbf{a}_{k-1}} \right) \right] \cdot \frac{\partial \mathbf{a}_t}{\partial W_{\text{input}}} \]
4.6.2. Gradient w.r.t. \(W_{h}\)
At \(t = 3\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{h}} \text{ from (t=3)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial W_{h}}\)
\(\hspace{4.7cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot \mathbf{O}_2\)
At \(t = 2\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{h}} \text{ from (t=2)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial \mathbf{\mathbf{O}}_2}\cdot \frac{\partial \mathbf{O}_2}{\partial \mathbf{a}_2}\cdot \frac{\partial \mathbf{a}_2}{\partial W_{h}}\)
\(\hspace{4.7cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot W_h \cdot (1 - \mathbf{O}_2^2) \cdot \mathbf{O}_1\)
At \(t = 1\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{h}} \text{ from (t=1)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial \mathbf{\mathbf{O}}_2}\cdot \frac{\partial \mathbf{O}_2}{\partial \mathbf{a}_2}\cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{\mathbf{O}}_1}\cdot \frac{\partial \mathbf{O}_1}{\partial \mathbf{a}_1}\cdot \frac{\partial \mathbf{a}_1}{\partial W_{h}}\)
\(\hspace{4.7cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot W_h \cdot (1 - \mathbf{O}_2^2) \cdot W_h \cdot (1 - \mathbf{O}_1^2) \cdot \mathbf{O}_0\)
Putting all together:
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_{h}} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \left[ \mathbf{O}_2 + W_h \cdot (1 - \mathbf{O}_2^2) \cdot \mathbf{O}_1 + W_h \cdot (1 - \mathbf{O}_2^2) \cdot W_h \cdot (1 - \mathbf{O}_1^2) \cdot \mathbf{O}_0 \right]\)
In general form:
\[ \frac{\partial \mathcal{L}}{\partial W_{h}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_T}\cdot \frac{{\partial \mathbf{O}_T}}{\partial \mathbf{a}_T}\cdot \left[ \prod_{k=t+1}^{T} \left( \frac{\partial \mathbf{a}_k}{\partial \mathbf{O}_{k-1}} \cdot \frac{\partial \mathbf{O}_{k-1}}{\partial \mathbf{a}_{k-1}} \right) \right] \cdot \frac{\partial \mathbf{a}_t}{\partial W_{h}} \]
4.6.3. Gradient w.r.t. \(b_{h}\)
At \(t = 3\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial b_{h}} \text{ from (t=3)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial b_{h}}\)
\(\hspace{4.4cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2)\)
At \(t = 2\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial b_{h}} \text{ from (t=2)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial \mathbf{\mathbf{O}}_2}\cdot \frac{\partial \mathbf{O}_2}{\partial \mathbf{a}_2}\cdot \frac{\partial \mathbf{a}_2}{\partial b_{h}}\)
\(\hspace{4.4cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot W_h \cdot (1 - \mathbf{O}_2^2)\)
At \(t = 1\):
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial b_{h}} \text{ from (t=1)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_3} \cdot \frac{\partial \mathbf{O}_3}{\partial \mathbf{a}_3} \cdot \frac{\partial \mathbf{a}_3}{\partial \mathbf{\mathbf{O}}_2}\cdot \frac{\partial \mathbf{O}_2}{\partial \mathbf{a}_2}\cdot \frac{\partial \mathbf{a}_2}{\partial \mathbf{\mathbf{O}}_1}\cdot \frac{\partial \mathbf{O}_1}{\partial \mathbf{a}_1}\cdot \frac{\partial \mathbf{a}_1}{\partial b_{h}}\)
\(\hspace{4.4cm} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \cdot W_h \cdot (1 - \mathbf{O}_2^2) \cdot W_h \cdot (1 - \mathbf{O}_1^2)\)
Putting all together:
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial b_{h}} = (\hat{y} - y) \cdot W_{\text{final}} \cdot (1 - \mathbf{O}_3^2) \left[ 1 + W_h \cdot (1 - \mathbf{O}_2^2) + W_h \cdot (1 - \mathbf{O}_2^2) \cdot W_h \cdot (1 - \mathbf{O}_1^2) \right]\)
In general form:
\[ \frac{\partial \mathcal{L}}{\partial b_{h}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_f} \cdot \frac{\partial z_f}{\partial \mathbf{O}_T}\cdot \frac{{\partial \mathbf{O}_T}}{\partial \mathbf{a}_T}\cdot \left[ \prod_{k=t+1}^{T} \left( \frac{\partial \mathbf{a}_k}{\partial \mathbf{O}_{k-1}} \cdot \frac{\partial \mathbf{O}_{k-1}}{\partial \mathbf{a}_{k-1}} \right) \right] \cdot \frac{\partial \mathbf{a}_t}{\partial b_{h}} \]
4.7. Final Weight Updates (Gradient Descent)
Using learning rate \(\eta\), we update all parameters:
\(\displaystyle\Rightarrow W_{\text{input}} := W_{\text{input}} - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_{\text{input}}}\)
\(\displaystyle\Rightarrow W_h := W_h - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_h}\)
\(\displaystyle\Rightarrow \mathbf{b}_h := \mathbf{b}_h - \eta \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{b}_h}\)
\(\displaystyle\Rightarrow W_{\text{final}} := W_{\text{final}} - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_{\text{final}}}\)
\(\displaystyle\Rightarrow b_y := b_y - \eta \cdot \frac{\partial \mathcal{L}}{\partial b_y}\)
This is known as Backpropagation Through Time (BPTT) for Many-to-One RNN.
5. many-to-many RNN with backpropagation
%%{init: {"theme":"neutral", "flowchart": {"nodeSpacing": 40, "rankSpacing": 70}}}%%
flowchart LR
%% Time Step Labels
T0["<b>t = 0</b>"]:::time --> A1
T1["<b>t = 1</b>"]:::time --> A2
T2["<b>t = 2</b>"]:::time --> A3
%% Forward pass: Hidden states
A0["O<sub>0</sub><br><i>h₀ = 0</i>"]:::init --> A1["O<sub>1</sub><br>a₁ = tanh(...) "]:::cell --> A2["O<sub>2</sub><br>a₂ = tanh(...) "]:::cell --> A3["O<sub>3</sub><br>a₃ = tanh(...) "]:::cell
%% Inputs
X1["x₁ ∈ ℝ²"]:::input --> A1
X2["x₂ ∈ ℝ²"]:::input --> A2
X3["x₃ ∈ ℝ²"]:::input --> A3
%% Outputs at each time step
A1 --> Y1["ŷ₁ = sigmoid(W<sub>hy</sub>a₁)"]:::output --> L1["L₁ = L(ŷ₁, y₁)"]:::loss
A2 --> Y2["ŷ₂ = sigmoid(W<sub>hy</sub>a₂)"]:::output --> L2["L₂ = L(ŷ₂, y₂)"]:::loss
A3 --> Y3["ŷ₃ = sigmoid(W<sub>hy</sub>a₃)"]:::output --> L3["L₃ = L(ŷ₃, y₃)"]:::loss
%% Total Loss
L1 --> SumL["<b>Total Loss:</b><br>L = L₁ + L₂ + L₃"]:::loss
L2 --> SumL
L3 --> SumL
%% Gradients flowing back
L1 -.-> dY1["∂L₁/∂ŷ₁"]:::grad -.-> dA1["∂L₁/∂a₁"]:::grad
L2 -.-> dY2["∂L₂/∂ŷ₂"]:::grad -.-> dA2["∂L₂/∂a₂"]:::grad
L3 -.-> dY3["∂L₃/∂ŷ₃"]:::grad -.-> dA3["∂L₃/∂a₃"]:::grad
%% Time-distributed gradients into hidden layers
dA3 -.-> A2
dA2 -.-> A1
dA1 -.-> A0
%% Styling
classDef cell fill:#b2fab4,stroke:#333,stroke-width:1px;
classDef input fill:#f9caca,stroke:#333,stroke-width:1px;
classDef output fill:#cce5ff,stroke:#333,stroke-width:1px;
classDef loss fill:#ffd9a0,stroke:#333,stroke-width:1px;
classDef grad fill:#ffe0e0,stroke:#cc0000,stroke-dasharray: 5 3;
classDef time fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5;
classDef init fill:#eeeeee,stroke:#999,stroke-dasharray: 2 2;
class T0,T1,T2 time
class A0 init
class A1,A2,A3 cell
class X1,X2,X3 input
class Y1,Y2,Y3 output
class L1,L2,L3,SumL loss
class dY1,dY2,dY3,dA1,dA2,dA3 grad
Great! Let’s now expand all the equations step-by-step for a many-to-many RNN with T = 3 time steps.
We’ll use:
- Hidden size = \(H\)
- Input size = \(I\)
- Output size = \(O\)
For simplicity, assume scalar output and 1D hidden states, so we can avoid matrix notation and focus on clarity.
🔧 Assumptions (for simplicity):
- Scalar hidden state and scalar outputs (you can later vectorize).
- Inputs: \(x_1, x_2, x_3 \in \mathbb{R}\)
- Targets: \(y_1, y_2, y_3 \in \mathbb{R}\)
- Initial hidden state: \(h_0 = 0\)
- Recurrent weight: \(W_h\)
- Input weight: \(W_x\)
- Output weight: \(W_y\)
- Biases: \(b_h, b_y\)
Forward Pass (Expanded for t = 1 to 3)
🔹 t = 1:
\(\displaystyle a_1 = W_x x_1 + W_h h_0 + b_h = W_x x_1 + b_h\)
\(\displaystyle h_1 = \tanh(a_1)\)
\(\displaystyle z_1 = W_y h_1 + b_y\)
\(\displaystyle \hat{y}_1 = \phi(z_1)\)
\(\displaystyle \mathcal{L}_1 = \frac{1}{2} (\hat{y}_1 - y_1)^2\)
🔹 t = 2:
\(\displaystyle a_2 = W_x x_2 + W_h h_1 + b_h\)
\(\displaystyle h_2 = \tanh(a_2)\)
\(\displaystyle z_2 = W_y h_2 + b_y\)
\(\displaystyle \hat{y}_2 = \phi(z_2)\)
\(\displaystyle \mathcal{L}_2 = \frac{1}{2} (\hat{y}_2 - y_2)^2\)
🔹 t = 3:
\(\displaystyle a_3 = W_x x_3 + W_h h_2 + b_h\)
\(\displaystyle h_3 = \tanh(a_3)\)
\(\displaystyle z_3 = W_y h_3 + b_y\)
\(\displaystyle \hat{y}_3 = \phi(z_3)\)
\(\displaystyle \mathcal{L}_3 = \frac{1}{2} (\hat{y}_3 - y_3)^2\)
🎯 Loss
Let total loss be: \(\displaystyle \mathcal{L} = \frac{1}{2} (\hat{y}_1 - y_1)^2 + \frac{1}{2} (\hat{y}_2 - y_2)^2 + \frac{1}{2} (\hat{y}_3 - y_3)^2\)
Backward Pass – Step-by-Step
Step 1: Gradient w.r.t. output weights
\(\displaystyle \frac{\partial \mathcal{L}}{\partial W_y} = (\hat{y}_1 - y_1) \cdot h_1 + (\hat{y}_2 - y_2) \cdot h_2 + (\hat{y}_3 - y_3) \cdot h_3\)
\(\displaystyle \frac{\partial \mathcal{L}}{\partial b_y} = (\hat{y}_1 - y_1) + (\hat{y}_2 - y_2) + (\hat{y}_3 - y_3)\)
Gradient w.r.t. \(W_x\)
We’ll expand:
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}}{\partial W_x} = \frac{\partial \mathcal{L}_1}{\partial W_x} + \frac{\partial \mathcal{L}_2}{\partial W_x} + \frac{\partial \mathcal{L}_3}{\partial W_x}\)
🔹 From \(t = 3\)
We have:
\(\displaystyle\Rightarrow \frac{\partial \mathcal{L}_3}{\partial W_x} = \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot \frac{\partial a_3}{\partial h_2} \cdot \frac{\partial h_2}{\partial a_2} \cdot \frac{\partial a_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial a_1} \cdot \frac{\partial a_1}{\partial W_x}\right] + \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot \frac{\partial a_3}{\partial h_2} \cdot \frac{\partial h_2}{\partial a_2} \cdot \frac{\partial a_2}{\partial W_x} \right] + \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot \frac{\partial a_3}{\partial W_x}\right]\)
Breakdown:
- \(\frac{\partial a_3}{\partial h_2} = W_h\)
- \(\frac{\partial h_2}{\partial a_2} = 1 - h_2^2\)
- \(\frac{\partial a_2}{\partial h_1} = W_h\)
- \(\frac{\partial h_1}{\partial a_1} = 1 - h_1^2\)
- \(\frac{\partial a_1}{\partial W_x} = x_1\)
- \(\frac{\partial a_2}{\partial W_x} = x_2\)
- \(\frac{\partial a_3}{\partial W_x} = x_3\)
Putting it all together:
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}_3}{\partial W_x} =\left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot W_h \cdot (1 - h_2^2) \cdot W_h \cdot (1 - h_1^2) \cdot x_1\right] + \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot W_h \cdot (1 - h_2^2) \cdot x_2\right] + \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot x_3\right]\)
🔹 From \(t = 2\)
\(\displaystyle\Rightarrow\frac{\partial \mathcal{L}_2}{\partial W_x} =\left[(\hat{y}_2 - y_2) \cdot W_y \cdot (1 - h_2^2) \cdot W_h \cdot (1 - h_1^2) \cdot x_1 \right]+ \left[(\hat{y}_2 - y_2) \cdot W_y \cdot (1 - h_2^2) \cdot x_2\right]\)
🔹 From \(t = 1\)
\(\displaystyle\Rightarrow \frac{\partial \mathcal{L}_1}{\partial W_x} = (\hat{y}_1 - y_1) \cdot W_y \cdot (1 - h_1^2) \cdot x_1\)
✅ Final Expression for \(\frac{\partial \mathcal{L}}{\partial W_x}\)
\(\displaystyle\Rightarrow \frac{\partial \mathcal{L}}{\partial W_x} =\left[(\hat{y}_1 - y_1) \cdot W_y \cdot (1 - h_1^2) \cdot x_1 \right]\\ + \left[(\hat{y}_2 - y_2) \cdot W_y \cdot (1 - h_2^2) \cdot x_2 \right]\\+ \left[(\hat{y}_2 - y_2) \cdot W_y \cdot (1 - h_2^2) \cdot W_h \cdot (1 - h_1^2) \cdot x_1 \right]\\+ \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot x_3 \right]\\+\left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot W_h \cdot (1 - h_2^2) \cdot x_2 \right] \\+ \left[(\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot W_h \cdot (1 - h_2^2) \cdot W_h \cdot (1 - h_1^2) \cdot x_1\right]\)
✅ Gradient w.r.t. Bias \(b_h\)
Just replace each \(x_i\) with 1 in the above expressions (because \(\frac{\partial a_t}{\partial b_h} = 1\))
So:
\(\displaystyle\Rightarrow \frac{\partial \mathcal{L}}{\partial b_h} = (\hat{y}_1 - y_1) \cdot W_y \cdot (1 - h_1^2) \\ + (\hat{y}_2 - y_2) \cdot W_y \cdot (1 - h_2^2) \\ + (\hat{y}_2 - y_2) \cdot W_y \cdot (1 - h_2^2) \cdot W_h \cdot (1 - h_1^2) \\ + (\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \\ + (\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot W_h \cdot (1 - h_2^2) \\ + (\hat{y}_3 - y_3) \cdot W_y \cdot (1 - h_3^2) \cdot W_h \cdot (1 - h_2^2) \cdot W_h \cdot (1 - h_1^2)\)
6. RNN problems and their solutions
6.1. Vanishing Gradient Problem
The issue:
During Backpropagation Through Time (BPTT), gradients are calculated by chaining partial derivatives through time steps. That chain includes terms like derivatives of activation functions (e.g., tanh' = 1 - tanh²(x)), which are < 1, leading to repeated multiplication of small numbers.
Result:
After many time steps, these gradients shrink exponentially, and the network stops learning from earlier time steps — even if they’re important.
🛠 Mitigations:
- LSTM / GRU: They maintain a memory cell with gates to decide what to keep or forget. These architectures are designed to carry gradients across long distances without vanishing.
- ReLU-based activations (with caution): Some variants like ReLU (instead of tanh/sigmoid) help because their derivative is 1 (or 0), so gradients don’t vanish as easily.
6.2. Exploding Gradient Problem
What happens:
When the weights in the RNN are large or poorly initialized, the gradient can grow exponentially through time steps.
Effect:
Loss becomes NaN, weights diverge, or model doesn’t converge.
🛠 Mitigation:
Gradient Clipping: Limit the gradient magnitude during backprop. ```python import tensorflow as tf
gradients = [tf.clip_by_value(grad, clip_value_min=-1.0, clip_value_max=1.0) for grad in gradients] ```
Keeps training stable even in long sequences.
6.3. Long-Term Dependencies
Problem:
Imagine you’re reading this sentence:
“The cat, which was chased by the dog, climbed the tree because it was scared.”
To understand what “it” refers to, you need to remember “cat”, even though there’s a lot of information in between. That’s a long-term dependency — something that happened much earlier still affects what comes later.
Even if vanishing gradients are controlled, simple RNNs are still bad at remembering information from many steps ago. This is due to their memoryless architecture — they overwrite their hidden state at every step.
Each hidden state \(\mathbf{O}_t\) is a function of:
\[ \mathbf{O}_t = \tanh(\mathbf{a}_t) = \tanh \left[W_{\text{input}} \cdot \mathbf{x}_t + W_h \cdot \mathbf{O}_{t-1} + \mathbf{b}_h\right] \]
Theoretically, \(\mathbf{O}_t\) should carry forward everything important. But in practice:
Each step overwrites the hidden state.
The gradients from early steps vanish or explode during backpropagation.
That means:
The model can’t learn to retain old information if it spans over many time steps.
It ends up focusing mostly on recent inputs.
🛠 Mitigation:
- LSTM: It uses a cell state and forget gates, which help it remember key information for long durations.
- GRU: Similar idea, fewer gates (update + reset), more efficient.
6.4. Slow Training / High Complexity
Why it’s slow:
- Sequential nature: RNNs must process one step at a time. Can’t leverage parallelism across time steps.
- Backprop Through Time: Costly because gradients need to flow through each time step.
🛠 Mitigation:
- Sequence bucketing: Group similar-length sequences to reduce padding.
- Truncated BPTT: Backpropagate only for a limited number of past steps.
- Layer Normalization / Batch Norm in RNNs.
- Switch to non-recurrent models (like Transformers) where applicable.
6.5. Difficult to Capture Global Context
Example:
In text generation, an RNN might forget who the subject was several sentences ago, leading to mismatched verb tense or pronouns.
🛠 Mitigation:
- Attention Mechanisms: Let the model look back at all previous time steps with learnable weights.
- E.g., in seq2seq models, attention lets the decoder choose relevant encoder states.
- Transformers take this to the next level by replacing recurrence with multi-head self-attention, allowing direct access to any part of the input sequence.
6.6. Difficult to Train Deep RNNs
Why:
Stacking multiple RNN layers makes gradient flow worse, leading to instability.
🛠 Mitigation:
- Residual connections across layers.
- Layer normalization.
- Use GRU/LSTM, which inherently manage better depth handling.
5.7. Summary Table:
| Problem | Description | Solution |
|---|---|---|
| Vanishing Gradient | Gradients shrink through time | LSTM / GRU, ReLU, Layer Norm |
| Exploding Gradient | Gradients grow exponentially | Gradient Clipping |
| Long-Term Dependency | Can’t retain long-range info | LSTM / GRU |
| Slow Training | Sequential nature limits parallelism | Truncated BPTT, Bucketing |
| Hard to Capture Context | Loss of long-range relevance | Attention, Transformers |
| Deep RNNs hard to train | Gradient instability | Residuals, Layer Norm, Gated Units |
7. RNN code example
In this, we are using imdb dataset from keras dataset library.
This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.
As a convention, “0” does not stand for a specific word, but instead is used to encode the pad token.
#1. importing
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
# Load Data
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=10000)
# Printing 1st sample and shapes
print("X_train[0] = ",X_train[0][:10], "\n", "y_train[0] = ", y_train[0])
print("X_train.shape = ", X_train.shape, "\n", "y_train.shape = ", y_train.shape)
print("X_test.shape = ", X_test.shape, "\n", "y_test.shape = ", y_test.shape)X_train[0] = [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
y_train[0] = 1
X_train.shape = (25000,)
y_train.shape = (25000,)
X_test.shape = (25000,)
y_test.shape = (25000,)
1 12500
0 12500
Name: count, dtype: int64
#3. Padding to make equal length
import numpy as np
max_len = 50
X_train = pad_sequences(X_train, padding='post', maxlen=max_len)
X_test = pad_sequences(X_test, padding='post', maxlen=max_len)
X_train = np.array(X_train, dtype='int32')
y_train = np.array(y_train, dtype='int32')
X_test = np.array(X_test, dtype='int32')
y_test = np.array(y_test, dtype='int32')
print(X_train.shape)(25000, 50)
#4. Preparing model
max_vocab = 10000
model = Sequential([
Embedding(input_dim=max_vocab + 1, # +1 for padding token
output_dim=32,
input_length = max_len,
mask_zero=True), # mask_zero tells RNN to ignore padding values
SimpleRNN(32),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_5 (Embedding) (None, 50, 32) 320032
simple_rnn_5 (SimpleRNN) (None, 32) 2080
dense_5 (Dense) (None, 1) 33
=================================================================
Total params: 322145 (1.23 MB)
Trainable params: 322145 (1.23 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
- Embedding layer Here, vocabulary_size = 10000, embedding_dim = 32
(10000 + 1) * 32 = 10001 * 32 = 320032
- Hidden layer
Input-to-hidden weights (W_xh): (input_dim, units) =
(32, 32) → 32 * 32 = 1024.Hidden-to-hidden weights (W_hh): (units, units) =
(32, 32) → 32 * 32 = 1024.Bias (b_h): (units,) =
(32,) → 32Total params =
1024 + 1024 + 32 = 2080.
- Output layer =
32 (weights) + 1 (bias) = 33
# 5. Train the model
history = model.fit(
X_train,
y_train,
epochs=5,
batch_size=32,
validation_data=(X_test, y_test)
)Epoch 1/5
782/782 [==============================] - 26s 30ms/step - loss: 0.5095 - accuracy: 0.7376 - val_loss: 0.4301 - val_accuracy: 0.8026
Epoch 2/5
782/782 [==============================] - 20s 25ms/step - loss: 0.3276 - accuracy: 0.8614 - val_loss: 0.4414 - val_accuracy: 0.8016
Epoch 3/5
782/782 [==============================] - 20s 25ms/step - loss: 0.1897 - accuracy: 0.9291 - val_loss: 0.5418 - val_accuracy: 0.7896
Epoch 4/5
782/782 [==============================] - 20s 26ms/step - loss: 0.0772 - accuracy: 0.9743 - val_loss: 0.6882 - val_accuracy: 0.7742
Epoch 5/5
782/782 [==============================] - 21s 26ms/step - loss: 0.0351 - accuracy: 0.9896 - val_loss: 0.8471 - val_accuracy: 0.7860
Overfitting is there because val_accuracy is decreasing.
8. Gradient Flow Visualization
%%{init: {"theme":"neutral", "flowchart": {"nodeSpacing": 15, "rankSpacing": 70, "height":1}}}%%
graph LR
subgraph Inputs
direction LR
style Inputs fill:#a7bde0,stroke:#64b5f6,font-size:30,stroke-width:2px
x1["x<sub>t-1</sub> (Token ID)"]
x2["x<sub>t</sub> (Token ID)"]
x3["x<sub>t+1</sub> (Token ID)"]
end
subgraph Embedding
direction LR
style Embedding fill:#d9b3e6,stroke:#ba68c8,font-size:30,stroke-width:2px
e1["e<sub>t-1</sub>"]
e2["e<sub>t</sub>"]
e3["e<sub>t+1</sub>"]
end
subgraph RNN
direction LR
style RNN fill:#a7e0b3,stroke:#85ff9f,font-size:30,stroke-width:2px
h1("h<sub>t-1</sub>")
h2("h<sub>t</sub>")
h3("h<sub>t+1</sub>")
end
subgraph Output
direction TB
style Output fill:#e78383,stroke:#f8cc52,font-size:30,stroke-width:2px
y("ŷ \n(Sentiment)")
end
%% Input to Embedding
x1 --> |Lookup| e1
x2 --> |Lookup| e2
x3 --> |Lookup| e3
%% RNN Connections
e1 --> |W<sub>e</sub>| h1
h1 --> |W<sub>h</sub>| h2
e2 --> |W<sub>e</sub>| h2
h2 --> |W<sub>h</sub>| h3
e3 --> |W<sub>e</sub>| h3
%% Final hidden to Output
h3 --> |W<sub>y</sub>| y
y --> Loss
%% Optional: Recurrent arrows (dashed)
h1 -.-> |h<sub>t-1</sub>| h2
h2 -.-> |h<sub>t</sub>| h3
8.1 Backpropagation in This RNN Model
Workflow:
Forward Pass:
- Inputs (word indices) are converted to vectors using the Embedding layer.
- These vectors are passed through the SimpleRNN.
- The output goes into a Dense layer with sigmoid, producing the final prediction.
Loss is computed using binary cross-entropy.
Backward Pass (Backpropagation Through Time - BPTT):
- Gradients from the loss are passed backward through the Dense → RNN → Embedding layers.
- Gradients flow from:
- Output → Dense layer weights
- Dense → RNN weights (input, recurrent)
- RNN → Embedding layer
- Output → Dense layer weights
How Embeddings Get Updated
- The Embedding layer acts like a lookup table: each word index corresponds to a row vector (of 32 dimensions here).
- During backpropagation:
- Only the vectors for the words present in the input are updated.
- Gradients flow from the RNN into the embedding vectors.
- This is just like updating weights in any other layer — embeddings are trainable by default.